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Abstract: 

Abstract: Automatic classification of scientific articles based on common 
characteristics is an interesting problem with many applications in digital 
library and information retrieval systems. Properly organized articles can 
be useful for automatic generation of taxonomies in scientific writings, 
textual summarization, efficient information retrieval etc. Generating arti- 
cle bundles from a large number of input articles, based on the associated 
features of the articles is tedious and computationally expensive task. In 
this report we propose an automatic two-step approach for topic extraction 
and bundling of related articles from a set of scientific articles in real-time. 
For topic extraction, we make use of Latent Dirichlet Allocation (LDA) 
topic modeling techniques and for bundling, we make use of hierarchical 
agglomerative clustering techniques. 

We run experiments to validate our bundling semantics and compare 
it with existing models in use. We make use of an online crowdsourc- 
ing marketplace provided by Amazon called Amazon Mechanical Turk to 
carry out experiments. We explain our experimental setup and empirical 
results in detail and show that our method is advantageous over existing 
ones. 



1 Introduction 

With the advancement of information retrieval systems, especially search technologies, 
finding relevant information about any topic under the sky is relatively an easy task. Search 
engines like Google are very effective and popular for web retrieval. Researchers rely on 
these search engines to gather related works relevant to their field of work. Most of the 
search engines run dedicated services for scientific literature search, example includes 
popular websites like Google Scholar [httl2cl and CiteSeerX fhttTZbl . All of the websites 
above mentioned are very competent and retrieve large number of articles on proper input 
query. For example, our search for scholarly articles for the topic 'topic modeling' resulted 
in 1,190,000 and 141,843 articles using Google Scholar and CiteSeerX respectively. These 
results are ordered based on the indexing and ranking algorithms used by the underlying 
search system and contain similar articles scattered over different pages. 



Grouping or bundling of articles, resulting from any extensive search into smaller coher- 



ent groups is an interesting but a difficult task. Even though lots of research studies were 
conducted in the area of data bundling, a concrete generalized algorithm does not exist. 
Effective grouping of data requires a precise definition of closeness between a pair of data 
items and the notion of closeness always depend on the data and the problem context. 
Closeness is defined in terms of similarity of the data pairs which in turn is measured in 
terms of dissimilarity or distance between pair of items. In this report we use the term sim- 
ilarity,dissimilarity and distance to denote the measure of closeness between data items. 
Most of the bundling scheme start with identifying the common attributes(metadata) of 
the data set, here scientific articles, and create bundling semantics based on the combina- 
tion of these attributes. Here we suggest a two step algorithm to bundle scientific articles. 
In the first step we group articles based on the latent topics in the documents and in the 
second step we carry out agglomerative hierarchical clustering based on the inter-textual 
distance and co-authorship similarity between articles. We run experiments to validate the 
bundling semantics and to compare it with content only based similarity. We used 19937 
articles related to Computer Science from arviv Ilhttl2a1 for our experiments. 



2 Topic Extraction 

2.1 Latent Dirichlet Allocation 

Latent Dirichlet Allocation(LDA) IIBNJ031 is a probabilistic generative model for docu- 
ment modeling. It is based on Probabilistic Latent Semantic Analysis(PLSA), a genera- 
tive model suggested by Thomas Hofmann in |LP99 Hof99|. |BNJ03| LDA is based on 
dimensionality reduction assumption, bag-of-words assumption i.e., order of words in a 
document is not important. Words are considered to be conditionally independent and 
identically distributed. Ordering of documents is also neglected and assumed to be inde- 
pendent and identically distributed. This is called document exchangeability. Same princi- 
ple applies for topics also. There is no prior ordering of topics which makes it identifiable. 
Basic assumption in the LDA model are given below 

• Number of documents are fixed 

• Vocabulary size is fixed 

• Number of topics are fixed 

• Word distribution is a multinomial distribution 

• Topic distribution is a multinomial distribution 

• topic weight distribution is a Dirichlet distribution 

• word distribution per topic is a Dirichlet distribution 

Generative model suggest a probabilistic procedure to generate documents given a dis- 
tribution over topics. Given a distribution over topics, a document can be generated by 



recursively selecting a topic over given topic distribution and then selecting a word from 
the selected topic. We, now, formally define the mathematical model behind LDA. We 
are given a fixed set of documents D ~ {di,d2,d3, ....cIn}, a fixed set of vocabulary 
W = {wi,'W2, W3, ....wm} and a set of topics T ~ {^1,^2, tk}- Let d £ D denote 
a random document,ti; g W denote a random word and t E T denote a random topic. 
Let P{d) be the probability of selecting a document, P{t\d) is the probability of selecting 
topic t in document d given the probability of selecting the document d and P{w\t) is the 
probability of selecting word w in topic t given the probability of selecting the topic t. The 
join probability distribution of the observed variables {d, w) is 

P{d,w) = P{d)P{w\d) 

Since w & d are conditionally independant over t 

tk tk 
P{w\d) =J2PHt)Pit\d) =^ Pid,w) ^ P{d)J2PHt)Pit\d) 
tl tl 

According to Baye's Rule 

tk 

P{d)P{t\d) = P{d\t)P{t) =^ P{d,w) ='^P{t)P{w\t)P{d\t) 

tl 

Now thinking the opposite direction, given a document, using statistical inference we can 
find the topics associated with each document. This illustrates the statistical inference 
problem, inverse of the approach mentioned above. Here given a document, we would like 
to find the associated topics which is most likely to have generated the given document. 
We refer to the set of topics generated using topic modeling method as topic classes. This 
involves inferring the word distribution in the topics and topic distribution in the docu- 
ments given the word distribution in the documents. LDA algorithm generates this topic 
classes using statistical inference techniques based on the assumptions given earlier. LDA 
tool we used, MALLET, uses an algorithm based on Gibbs sampling IICG92I to estimate 
the topic classes. 



3 Bundling 



In this section, we will elaborate the second step of our algorithm i.e., bundling documents 
in a given topic class. A topic, selected from the set of topics generated by LDA, is given 
to the clustering system and it generates coherent bundles based on the selected similarity 
measures. Similarity measures for our data set is defined based on extended co-authorship 
and inter-textual distance. 



3.1 Extended Co-authorship Dissunilarity 



Extended Co-authorship Dissimilarity between two articles is conceived in terms of the 
similarity between the extended co-authors of the articles. Extended co-authors is defined 
as the union of the set of authors and referenced authors of an article. Extended Co- 
authorship Similarity between two articles is defined as the Jaccard Coefficient on the 
extended co-authors two articles. 

, , \ Extended Co — authi A) f] Extended Co ~ auth(B)\ 

SIMiA B) — - ^ ' ^ ' I where 

\Extended Co ~ auth{A) U Extended Co — auth{B)\^ 

Extended Co ~ auth{A) — Extended Co — authorship of article A 

Extednded Co — auth(B) = Extended Co — authorship of article B 

Corresponding Extended Co-authorship Dissimilarity is defined as ExtCoauth{A, B) = 
1 — SIM {A, B). We create a proximity matrix ExtCoauth containing the Extended Co- 
authorship Dissimilarity among all the articles as ExtCoauth = [ExtCoauthi,j]n*n = 
[ExtCoauth(i, j)] 



3.2 Inter-textual Distance 

Inter-textual distance, due to Labbe BLLOll . is defined over the frequency of the common 
vocabulary of the texts. It measures the relative distance of the texts from each other. 
Mathematically inter-textual distance between two texts A and B is defined as, 

E mA~E,AiB)\\ 
Va,Va(b) 

Na + Na{B) 

Here, FiA is the frequency of the i*'' vocabulary in document A, Fib is the frequency of 
the i*^ vocabulary in document B, EiA(B) is the frequency of the i*^' vocabulary in B with 
mathematical expectation more than or equal to one with respect to A. Na is the sum of 
the frequency of vocabulary in A, Nb is the sum of the frequency of vocabulary in B and 
Na(b) is the sum of frequency of vocabulary in B with expectation value more than or 
equal to one. A proximity matrix Cont is constructed for all the articles in the given topic 
class containing the inter-textual distances between them. 

Va Vb 

EiA(B) = PiB X -Tf^, Nb = FiB 



3.3 Bundling 



To apply hierarchical, agglomerative clustering algorithm, we create a combined proximity 
matrix D from the respective distance measures ExtCoauth and Cont as given by 

D =a * ExtCoauth + (1 - a) * Cont, < a < 1 

where a is the weight factor We apply fastcluster algorithm as in IMullll for hierarchical 
agglomerative clustering to create ^Jn number of bundles, where n is the number of articles 
in the selected topic class. 



4 Experiments 

In this section, we detail the experimental protocol based on Amazon Mechanical Turk, 
a crowdsourcing market place of Amazon. There are two types of evaluation techniques 
employed to measure the quality of clustering schemes, one being theoretical evaluation 
and other being user evaluation. We make use of user evaluation techniques here. 
Amazon Mechanical Turk(AMT) |htt| is a crowdsourcing marketplace service provided 
by Amazon where users can work on small tasks which is currently difficult to achieve 
using computers i.e., work that requires human intelligence. In Mechanical Turk termi- 
nologies, tasks are called Human Intelligence Tasks(HIT), user who provides task is called 
Requester and user who works on the task is called Worker A HIT is a well explained, 
self-contained question of the type described earlier A requester will create HITs and 
publish it on AMT. A requester can assert some mechanism to recruit suitable workers 
or filter out unskilled workers through qualification tests. To run the experiments, we se- 
lected three topic classes from the set of twenty six topic classes. Topic classes selected 
for the experiments are Machine Learning, Information Retrieval and Graph Theory. We 
selected five bundles from these three topic classes. Each of these bundle is presented to 
users to check the quality and compare it with bundles generated using content based only 
clustering. 



4.1 Independent Study 

In independent study, we measure the quality of the bundling process independently through 
Worker feedback. Here we validate the semantics by asking the Worker to comment on 
the quality of the bundles generated by our algorithm. Here we make use of survey ques- 
tionnaire in which we ask the Workers are asked to read the articles in the bundle and give 
their feedback on the similarity of the articles in the bundle. 



4.1.1 Results 



Results of the survey questions are detailed in Table 1 . All the 29 users participated in 
the survey for topic information retrieval confirm that the articles in the bundles are very 
similar which is 100% success ratio. Out of the 24 users who participated in the survey for 
Graph Theory 20 users affirm that the member articles in the bundles are similar. In case 
of Machine Learning, 84.2% of the participated workers agree with the member article 
similarity in the bundle. Overall the agreement ratio of independent study is 89.1% which 
is a very good indication that our selection of clustering semantic is very good. 



Topic 


Agreement 


Non-agreement 


Agreement Ratio 


Information Retrieval 


29 





100% 


Graph Theory 


20 


4 


83.3% 


Machine Learning 


32 


6 


84.2% 


Overall 






89.1% 



Table 1: Independent Study Results 



4.2 Comparative Study 

In comparative study, we ask the worker to do "side-by-side" comparison of two bundling 
results one based on the content and extended authorship similarity and the other one based 
on content similarity only. Aim of the comparative study is to check whether the semantic 
used by us gives a better result than the other popular commonly used semantics. We 
employ survey type questionnaire in which we ask the worker to read the two bundles and 
give each bundle most appropriate name. At the end they are asked to point the bundle, 
which was easiest to name. Our assumption is diverse bundle will be difficult to name and 
similar bundle will be very easy to name. 

4.2.1 Results 

Results of the comparative study is given in the tables 2,3, 4 and 5. Table 5.2 contains 
the result of the survey questionnaire for the topic information retrieval. Overall a total of 
50 users per topic class participated in the evaluation. 82.7% of the workers selected the 
extended co-authorship + content based similarity over only content based similarity. 



Batch 


Preferred 


Not Preferred 


1 


8 


2 


2 


8 


2 


3 


10 





4 


8 


2 


5 


9 


1 


Result 


86% 


14% 



Table 2: Comparative Study Results Information Retrieval 



Batch 


Preferred 


Not Preferred 


1 


8 


2 


2 


8 


2 


3 


9 


1 


4 


7 


3 


5 


9 


1 


Result 


82% 


18% 



Table 3: Comparative Study Results Graph Theory 



5 Conclusion 



Our algorithm gives very promising result when used with unstructured data and we be- 
lieve that with structured data it will yield far better results. User study conducted on 
the unstructured data set shows very positive indication towards the effectiveness of our 
bundling semantics. Our algorithm can be easily extended by using other similarity mea- 
sures like Year-of-Publishing, co-authorship graphs, keywords etc. 
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Table 4: Comparative Study Results Machine Learning 





Preference Ratio 


Non-Preference Ratio 


Result 
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Table 5: Comparative Study Results (Combined) 
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